Workshop Day 1B | 2022-07-25 Jeffrey M. Girard | Pitt Methods
Program II
Functions
Recipes allow chefs to cook up tasty treats
Recipes call for ingredients
Recipes involve one or more steps
Steps transform ingredients into treats
Functions are like customizable recipes
Functions call for inputs (“arguments”)
Functions involve one or more lines of code
Code transforms inputs into outputs
Using functions requires parentheses (usually)
out <- f(in1, in2)
Functions Live Coding
# USECASE: Function can perform a task more easily and readably# TEMPLATE: output <- function_name(input)9^ (1/2)x <-sqrt(9)x# ==============================================================================# LESSON: We can also use functions to transform objectsy <-9sqrt(y)# ==============================================================================# LESSON: We can even use functions to transform the result of calculations2/3round(2/3)# ==============================================================================# LESSON: We can customize what a function does using arguments# TEMPLATE: output <- function_name(argument, argument_name = argument_value)round(2/3, digits =2)round(2/3, digits =3)# ==============================================================================# LESSON: Some arguments are optional because they have default valuesround(2/3) # the default value for digits is 0round(2/3, digits =0)
Vectors
Vectors combine similar objects into a collection
I like to imagine a train pulling multiple cars
A vector is one object with many sub-objects
We refer to each sub-object as an element
Some functions transform each element in turn
Double the amount of cargo in every train car
Some functions summarize across elements
Calculate the total cargo across all train cars
v <- c(1, 2, 3)
Vectors Live Coding
# LESSON: We can combine multiple elements into a vector# TEMPLATE: vector_name <- c(element1, element2, element3)x <-491625# errorx <-c(4, 9, 16, 25)xy <-c(2, 3)y# ==============================================================================# LESSON: We can also combine multiple vectors and elementsc(x, y)c(x, y, 20)# ==============================================================================# USECASE: Math operators will transform each element individuallyx +1x *3x # but again, this won't be saved unless you use assignment# ==============================================================================# USECASE: Some functions will also transform each element individuallysqrt(x)log(x)# ==============================================================================# USECASE: Other functions will summarize the vector with a single numberlength(x)sum(x)mean(x)
Strings
When talking to R, we need a way to distinguish
Object/function names (e.g., the mean function)
Text/character data (e.g., the word mean)
Strings are R’s way of storing text data
Strings can store any characters (no rules!)
Strings are created and displayed with quotes
R has great tools for working with strings
Strings can be collected into vectors
Special functions can transform strings
name <- "John Doe"
Strings Live Coding
# USECASE: Strings are the main way to store character data in Rmy_color <- red # errormy_color <-"red"# correct# ==============================================================================# USECASE: Strings can also store symbols not allowed in object namesdye <-"red#40"dyedyes <-c("red#40", "blue#02")dyes# ==============================================================================# PITFALL: Many operations you can do to numbers won't work for stringsdyes +1# errormean(dyes) # error# ==============================================================================# USECASE: But other operations work for both or even just for stringslength(dyes)nchar(dyes)dyes2 <-toupper(dyes)dyes2
Packages
Cookbooks are a great way to learn to cook
They contain lots of recipes and instructions
Browse an online bookstore for a cookbook
Order it to add it to your personal bookshelf
To use, pull the cookbook off the shelf
Packages are like cookbooks for R
They contain helpful functions and datasets
Browse an online repository for a package
Install it to add it to your personal library
To use, load the package from the library
library("pkg_name")
Packages Live Coding
# USECASE: The stringr package adds a function to fix capitalizationstudents <-c("mary anne", "BENjamin", "Lee")# ==============================================================================# PITFALL: But we can't use that function without installing the packagestr_to_title(students) # error# ==============================================================================# LESSON: Installing a package using RStudio# - RStudio > Extras pane > Packages tab > Install button# ==============================================================================# PITFALL: We also need to load the package before we can use itstr_to_title(students) # error# ==============================================================================# LESSON: We load the package using library()library("stringr")str_to_title(students) #finally works!# ==============================================================================# LESSON: We can also keep our packages updated using RStudio# RStudio > Extras pane > Packages tab > Update button
Wrangle I
Tidy Data Principles
There are many ways to store data
We will be learning the tidy data format
Data should be rectangular
Each variable has its own column
Each observation has its own row
Each value has its own cell
Other Data Advice
Name all variables in the first row
This is called a header row
Avoid merged cells for data storage
These are okay for communication
Avoid empty cells whenever possible
Mark missing data as NA
Avoid formatting-as-data for storage
e.g., non-redundant color-coding
Tidying Example 1
Not Tidy
Name
Ann
Bob
Cat
Dom
Age
13
10
11
11
Weight
56.4
46.8
41.3
43.3
❌ Here, each row is a variable and each column is an observation.
Tidy
Name
Age
Weight
Ann
13
56.4
Bob
10
46.8
Cat
11
41.3
Dom
11
43.3
✔️ Here, each column is a variable and each row is an observation.
Tidying Example 2
Not Tidy
Names:
Ann
Bob
Cat
Dom
Age
Weight
13
56.4
10
46.8
11
41.3
11
43.3
❌ Here, we have data that is not rectangular because the Names variable has its own row.
Tidy
Name
Age
Weight
Ann
13
56.4
Bob
10
46.8
Cat
11
41.3
Dom
11
43.3
✔️ Here, we have made the data rectangular by moving the Names variable to its own column.
Tidying Example 3
Not Tidy
country
year
cases / population
Afghanistan
1999
NA / 19987071
2000
2666 / 20595360
Brazil
1999
37737 / 172006362
2000
80488 / 174504898
China
1999
212258 / 1272915272
2000
213766 / 1280428583
❌ Here, we have merged cells and two values stored in a single cell.
Tidy
country
year
cases
population
Afghanistan
1999
NA
19987071
Afghanistan
2000
2666
20595360
Brazil
1999
37737
172006362
Brazil
2000
80488
174504898
China
1999
212258
1272915272
China
2000
213766
1280428583
✔️ Here, we have un-merged the countries and separated the cases and populations variables into columns.
Tidying Example 4
Not Tidy
student
grade
Amber
91.5
A-
Bristol
86.2
B
Charlene
94.0
A
Diego
89.3
B+
Legend: Psych. Major, Psych. Minor
❌ Here, we have a missing variable name and formatting-as-data.
Tidy
student
psych
grade
letter
Amber
major
91.5
A-
Bristol
minor
86.2
B
Charlene
major
94.0
A
Diego
NA
89.3
B+
✔️ Here, we have added a column for the psych variable, removed the legend, and named the letter variable.
Tidying Example 5
Not Tidy
student
grade
letter
Amber
91.5
A-
Bristol*
94.2
A
Class Summary
As
2
Yay!
Bs
0
*Grade was revised.
❌ Here, we have two types of data in one file and a footnote as data.
Tidy
student
grade
letter
revised
Amber
91.5
A-
FALSE
Bristol
94.2
A
TRUE
letter
count
notes
A
2
Yay!
B
0
✔️ Here, we have split the data into two separate tables and added the revised and notes variables.
Long vs. Wide Format
Wide Format
date
Boeing
Amazon
Google
2009-01-01
$173.55
$174.90
$174.34
2009-01-02
$172.61
$171.42
$170.04
✔️ Here, we have a wide format where each observation is a date.
Long Format
date
stock
price
2009-01-01
Boeing
$173.55
2009-01-01
Amazon
$174.90
2009-01-01
Google
$174.34
2009-01-02
Boeing
$172.61
2009-01-02
Amazon
$171.42
2009-01-02
Google
$170.04
✔️ Here, we have a long format where each observation is the combination of a date and a stock.
Tibbles
R works particularly well with tidy data
We store tidy data in data frames or tibbles
Tibbles are just fancier data frames (i.e., they have a few extra features)
To use tibbles, we need the tidyverse package
Tibbles are constructed from one or more vectors
The vectors must have the same length
They can contain different types of data
Vectors
We start with three separate vector objects that all have the same length.
We set it up so that the \(n\)-th car in each train corresponds to the same observation.
Tibble
Then we combine the vectors into a single tibble (or data frame) object.
Now, as the tibble moves around, the variables always stay together.
Tibbles Live Coding
# SETUP: Install and load the tidyverse package# Extras pane > Packages tab > Installlibrary(tidyverse)# ==============================================================================# LESSON: Create a tibble from vectorsx <-c(10, 20, 30, 40)xy <- x *2-4ymy_tibble <-tibble(x, y)my_tibble# ==============================================================================# USECASE: You can mix different types of vectors in a single tibblefirst_names <-c("Adam", "Billy", "Caitlyn", "Debra")age_years <-c(12, 13, 10, NA)guests <-tibble(first_names, age_years)guests# ==============================================================================# TIP: To save time, you can also create the vectors in the tibble callgradebook <-tibble(grade =c(95, 83, 90, 76),letter =c("a", "b", "a-", "c"))gradebook# ==============================================================================# PITFALL: Don't try to combine tibbles with different lengthsy <-c(1, 2, 3)x <-c("a", "b")tibble(y, x) #error# ==============================================================================# LESSON: However, the exception is R will "recycle" a single valuetibble(y, x ="a")# ==============================================================================# LESSON: You can "extract" a vector from a tibble using $mytibble <-tibble(x =c(1, 2, 3, 4, 5), y ="test")mytibble$xmytibble$y# ==============================================================================# PITFALL: Don't try to extract a vector that doesn't existmytibble$z #error
Importing and Exporting
Data is usually stored in data files
Importing files into R is called reading
Exporting files from R is called writing
A convenient data file type is a CSV
This stands for comma-separated values
A CSV file is easy to share with other people
The tidyverse package can read/write CSVs
Other packages can read/write other types (e.g., readxl, haven, rio, googlesheets4)
Read/Write Live Coding
# SETUP: Load the tidyverse package (if you haven't yet)library(tidyverse)# ==============================================================================# USECASE: Create a tibble and write it to a filegradebook <-tibble(id =c(123, 456, 789),grade =c("A", "B", "A"))gradebookwrite_csv(gradebook, file ="gradebook.csv")# NOTE: You can see the new file in Extras pane > Files tab.# You can open the file in another program (e.g., Microsoft Excel).# You can also email this file to someone else to share it.# ==============================================================================# PITFALL: Don't swap the order of the tibble and the filewrite_csv("gradebook.csv", gradebook) # error# ==============================================================================# USECASE: Read in a file containing dataold_gradebook <-read_csv("gradebook.csv")old_gradebook# NOTE: read_csv() will examine and guess the data type of each variable.# You can tell it the data type of each variable, but that is more advanced.# ==============================================================================# PITFALL: Don't use the read.csv() and write.csv() functionsold_gradebook <-read.csv("gradebook.csv") # not a tibbleold_gradebook